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ABSTRACT 

The channel proteins belonging to the major intrin- 
sic proteins (MIP) superfamily are diverse and are 
found in all forms of life. Water-transporting 
aquaporin and glycerol-specific aquaglyceroporin 
are the prototype members of the MIP superfamily. 
MIPs have also been shown to transport other neu- 
tral molecules and gases across the membrane. 
They have internal homology and possess con- 
served sequence motifs. By analyzing a large num- 
ber of publicly available genome sequences, we 
have identified more than 1000 MIPs from di- 
verse organisms. We have developed a database 
MIPModDB which will be a unified resource for all 
MIPs. For each MIP entry, this database contains 
information about the source, gene structure, se- 
quence features, substitutions in the conserved 
NPA motifs, structural model, the residues forming 
the selectivity filter and channel radius profile. For 
selected set of MIPs, it is possible to derive 
structure-based sequence alignment and evolution- 
ary relationship. Sequences and structures of se- 
lected MIPs can be downloaded from MIPModDB 
database which is freely available at http://bioinfo 
.iitk.ac.in/MIPModDB. 

INTRODUCTION 

Major intrinsic proteins (MIPs) form one of the largest 
superfamily of channel proteins (1,2). They transport 
water, neutral solutes such as glycerol and urea, metalloids 
including antimonite and arsenite and gases hke CO2, 
nitric oxide and ammonia across the membranes (3-5). 
Water-transporting aquaporin and glycerol-specific aqua- 
glyceroporin are the prominent members of this family. 
Members of MIP superfamily are involved in vital physio- 
logical processes such as skin moisture, gastrointestinal 



fluid transport, fat metabohsm, epidermal proHferation, 
maintaining corneal and lens transparency in eyes and 
water homeostasis in kidney and central nervous system 
(6-11). MIPs are found from bacteria to humans and a 
large number of diverse MIP members have been identi- 
fied especially in plants (12-17). The abundant MIPs 
identified in higher plants can be classified into at least 
five major subfamilies (15). In humans, they are implicated 
in several diseases such as nephrogenic diabetes insipi- 
dous, acute and chronic renal failure, brain edema, cata- 
ract and arsenic toxicity (18-24). Sequence analysis of 
MIP members clearly revealed the presence of highly 
conserved Asn-Pro-Ala (NPA) signature sequence motif 
(25,26). MIP sequences also possess internal homology in 
which the N- and C-terminal halves have significant 
sequence similarity (25). At the structural level, the 
N-terminal half is related to the C-terminal half by a 
pseudo-2-fold symmetry (27,28). Three-dimensional struc- 
tures of more than 10 MIPs from different organisms such 
as mammahan (29-31), plant (32), Escherichia coli (33), 
yeast (34) and archaea (35) have been determined. They 
all adopt a unique hourglass fold even when the sequence 
identity among them is very low. Two regions of constric- 
tion have been identified within the channel. The two NPA 
motifs form the central constriction and the outer con- 
striction is toward the extracellular side known as 
aromatic/arginine (ar/R) selectivity filter formed by four 
residues. Both regions are known to play important role in 
the solute transport and selectivity (36-39). With their in- 
volvement in human physiology and pathophysiology and 
with the available structural knowledge, members of MIP 
family are being considered as attractive drug targets 
(40,41). For example, aquaglyceroporin in intracellular 
parasite Plasmodium falciparum and its structural know- 
ledge (42) has opened new options for novel malaria 
therapies (43). However, for the majority of MIPs from 
large number of organisms, the tissue localization, their 
functional properties and the biological significance are 
either not known or they are not clearly understood. 
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With the advent of new generation of sequencing techno- 
logies (44,45), genome sequences of large number of or- 
ganisms are available. Using conserved sequence motifs 
and the internal sequence similarity as constraints, we 
have previously searched genome sequences of plants to 
identify and characterize the MIP proteins (15,16). We 
have extended this approach to genomes of other organism 
groups and identified more than 1000 MIP sequences in 
diverse organisms. The wealth of information on MIP se- 
quences is now stored in a database called MIPModDB 
and various details about the substitutions in the con- 
served NPA motif, gene structure and the features 
obtained from structural models are available for each 
MIP sequence. With diverse MIP sequences, one can 
also obtain structure-based sequence alignment and evo- 
lutionary relationship for a given set of MIP sequences. 

DATABASE CONSTRUCTION 

MIP genes were identified using BLAST (46) from com- 
pleted and partial genomes available in NCBI database 
and the protocol used for this purpose is the same as de- 
scribed in our previous studies (15,16). Thirteen human 
aquaporins and five plant MIPs belonging to the five major 
plant subfamilies were used as query sequences. Sequences 
of short length and those sequences with missing trans- 
membrane segments and important loop regions were dis- 
carded. We also found that some MIPs have been wrongly 
annotated. For example, an MIP from Salinispora tropica 
is annotated as 'low molecular weight phosphotyrosine 
protein phosphatase' (NCBI accession: YP_001 157491). 
MIPs from plants identified from previous work 
(13-16,47) were also included. We have also searched the 
motif-oriented database, MIPDB (48), and considered 
those MIP sequences not identified in the above 
BLAST search. The final data set contained 1008 MIP 
sequences from 341 different organisms. For each MIP 
sequence, presence of conserved NPA sequence motif 
and internal sequence similarity were examined and the 
sequences were also confirmed by pattern databases like 
Pfam (49) and PROSITE (50). The three-dimensional 
structure of each MIP sequence was modeled by hom- 
ology modeling procedure using the same protocol that 
was applied earlier for plant MIPs (15,16). Structures of 
bovine AQPl, E. coli GlpF and archeal AQPM were 
used as template structures and their PDB (51) IDs are 
IJ4N, 1FX8 and 2F2B, respectively. The channel radius 
profile was calculated using the HOLE program (52) as 
described previously (16). Thus the contents of this data- 
base can be largely categorized into sequence and struc- 
tural data and are explained in more details in the 
following sections. The important statistics related to 
sequence and structure data of MIPModDB is given in 
Table 1. Both the sequence and structure data for a rep- 
resentative MIP sequence is shown in Figure 1. 

MIP SEQUENCE DATA 

In general, only a single MIP has been identified in most 
of the microbial genomes. Plants have large number of 



Table 1. Important statistics of MIPModDB 



Number of MIP sequences 1008 

Number of organisms 341 

Substitutions in the NPA motifs 219 

Only in the first NPA motif (loop B) 74 

Only in the second NPA motif (loop E) 82 

Both NPA motifs 63 

MIPs with selectivity filter similar to water-channels" 349 

(FHTR + FHAR + FHCR + FHSR) 

MIPs with selectivity filter similar to glycerol channels" 170 

(WGYR + WGFR + WGWR) 

Experimentally determined MIP structures 38 



"The selectivity filter is formed by four residues and the corresponding 
amino acids are given in one letter codes. The first and second residues 
come from the second and the fifth transmembrane segments, respect- 
ively. The other two residues are contributed by the loop E. See text for 
details. 



MIPs in comparison to animals. For each MIP, its 
source and the NCBI accession ID are given. Each MIP 
sequence is also given a unique identifier derived from its 
scientific name. The first two characters are taken from its 
genus and the next four characters are from its species 
name followed by a four digit unique number. Wherever 
it is available, UNIPROT (53) accession ID is also 
provided. Apart from the primary structure information, 
sequence data includes exon-intron organization of the 
gene, substitution (if any) in the conserved NPA motif 
and percentage sequence similarity with the template se- 
quences that are used in the homology modeling proced- 
ure to build three-dimensional models. Sequence similarity 
between a given MIP sequence and the template sequences 
was calculated using the program NEEDLE as available 
in the EMBOSS suite of programs (54). Only the modeled 
part of the target MIP sequence was considered for this 
purpose. 

Gene structure 

For each MIP, gene structure is represented in the form of 
a graphical diagram. It gives the length and the positions 
of the introns with respect to the secondary structures of 
the corresponding MIP. The red and blue vertical bars 
indicate the starting positions of helices and loops B/E, 
respectively. The information regarding the positions of 
each individual introns was extracted from the NCBI 
database annotations. Transmembrane segments are 
marked based on the modeled structures (see below). 
Knowledge of intron-exon organization helps to under- 
stand the evolution of MIPs across different organisms 
and MIP subfamihes within a same species. For 
example, in plants it has been shown that the number 
and positions of introns are conserved within a given 
MIP subfamily (14-16). 

Substitutions in NPA motif 

In addition to its role in solute transport and selectivity 
(36,55-57), substitutions in the highly conserved NPA 
motifs seem to be important in other functional roles 
such as protein targeting (58) and full expression of the 
protein (59). In our data set, substitutions in only the first 
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MIPModDB 



HOSAPI0405 

ACCESSION NUMBER 

ORGANISM 

ANNOTATION 

UNIPROT 
ACCESSION 



Statistics Publications About us 



NP 932766.1 
Homo sapiens 
aquaporin 1 



ACPI HUMAN 
1H6I 



GENE STRUCTURE 



I Exons ^ Introns T Helix start position ^ Loop B and Loop E start position 
SEOUENCE BASED INFORMATION 



Length of 
sequence 


Length 
modeled 


Similarity with templates 
(as calculated by NEEDLE program from 
EMBOSS Package) 




from 1 to 


1J4N 1 1FX8 1 2F2B 


269 


1 1 247 


93.6 1 39.5 1 46.3 



STRUCTURE BASED INFORMATION 



Residues in 
selectivity filter 


NPA motif 


Root mean squared deviation with 
templates 
(as calculated by DALI program) 




1 1 II 


1J4N 1 1FX8 1 2F2B 


FHCR 


NPA 1 NPA 


1.3 1 2.1 1 1.8 





View FASTA sequence 



Structure based alignment 



View: 



Structure superimposed with 1FX8 
Structure superimposed with 1J4N 



Downloads: 

Alignment in PIR Format 
Alignment in PAP Format 
Modelled Structure in PDB Format 
Diameter profile along the 
channel 



Figure 1. Screenshot of a representative MIP protein, human aquaporin 1. Information about gene structure, substitutions in NPA motif, residues 
forming the ar/R selectivity filter, sequence similarity with the templates, RMSD calculated for the modeled structure with the three template 
structures are some of the features reported for a given MIP in the protein page. 



NPA motif occur in 74 examples. In 82 cases, substitutions 
are found only in the second NPA. Both NPA motifs are 
substituted in 63 MIPs (Table 1). In total, substitutions in 
at least one of the NPA motifs are found in about 22% of 
the total MIPs in our data set. Majority of the substitu- 
tions involve mutation of either Pro or Ala of NPA motif. 
Only a handful of examples (less than 16) are found in 
which Asn is mutated indicating its important role in both 
in structure as a helix capping residue and function as a 
residue responsible for cation exclusion as demonstrated 
in recent studies (60). 

DATA FROM MIP STRUCTURAL MODELS 

Structure-based data includes the atomic model obtained 
using the homology modeling procedure, residues that 



form the ar/R selectivity filter, structure-based sequence 
ahgnment, conservation of residues at the helix-helix inter- 
face and the HOLE radius profile. The structure-based 
details also include the root mean square deviation 
(RMSD) calculated for the modeled MIP and each of 
the template structure using the program DALI (61). 
The superposed figures are available in two different orien- 
tations (Figure 2A). Many MIPs have long N- and 
C-terminal extensions and hence the start and end pos- 
itions of the polypeptide segment used in the homology 
modeling method are given. 

Aromatic/arginine selectivity filter 

Four residues form the outer constriction nearly 8 A from 
the conserved NPA motif toward the extracellular side. 
These residues are contributed by second and fifth TM 
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Figure 2. (A) Superposition of the modeled MIP structure with 1J4N, the structure of bovine aquporin shown in two different orientations, namely, 
parallel (left) and perpendicular (right) to the channel axis. (B) The four residues of the ar/R selectivity filter superposed on that of 1J4N structure. 
(C) Comparison of HOLE radius profiles plotted for the water channel (green), glycerol facilitator (blue) and the modeled MIP structure (red). The 
ar/R selectivity filter is approximately located at —10 A. (D) Phylogenetic tree calculated for all MIPs of a representative organism Phytophlhora 
infestans using parsimony method. 



segments and loop E. This aromatic/arginine (ar/R) select- 
ivity filter has been implicated in obstructing the proton 
conduction (62) and efficient solute transport (37,63,64). 
The four ar/R selectivity filter residues, represented by 
their one letter codes, are given for each MIP. For 
example, 'FUR' indicates that the first and second residues 
Phe and He are from TM2 and TM5, respectively, and the 
last two are the loop E residues. Analysis of selectivity filter 
residues indicates that there are 349 MIPs, which have se- 
lectivity filter (FHTR, FHAR, FHCR or FHSR) similar 
to that found in water-selective aquaporin channels 
(FHTR or FHCR). The number of MIPs having selectiv- 
ity filter (WGYR, WGFR or WGWR) similar to the 
glycerol-specific aquaglyceroporin (WGFR) is 170. Thus, 
about 50% of the total MIPs in the database have select- 
ivity filter typical of aquaporin or aquaglyceroporin. The 
remaining half has substitutions that can alter the size and 
chemical nature of the outer constriction. This will have 
major influence in the nature of solute that is being trans- 
ported by the channel. The channel diameters of water 
channel and the aquaglyceroporin at the ar/R constriction 
are 2.0 and 3.5 A, respectively. The selectivity filter 
residues of the predicted MIP model superposed on that 



of the pure water channel from bovine aquaporin and 
that of aquaglyceroporin from E. coli are available 
(Figure 2B). The HOLE radius profiles of all three 
MIPs, water channel, glycerol channel and the modeled 
MIP channel, can be compared (Figure 2C). This will give 
an idea about the size of the region around the ar/R se- 
lectivity filter region with respect to both water channel 
and aquaglyceroporin and it will help the user to predict 
the possible size of a solute that can pass through this 
constriction. 

Structure-based sequence alignment 

As mentioned earher, MIP sequences are diverse. For 
example, the sequence identities between some of the 
plant MIP subfamihes are as low as 20% (13,15). In 
such cases, programs such as ClustalW (65) that are 
used to generate multiple sequence alignment of a given 
set of sequences are unhkely to produce meaningful align- 
ment for a diverse set of MIP sequences. Instead, if the 
structurally equivalent positions belonging to the TM 
helical segments are aligned, they are likely to result in 
high conservation and indicate the importance of residues 
in certain positions. Structure-based sequence alignment 
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for all the TM segments and the functionally important 
loops B and E is provided for aU the MIPs present in the 
same organism along with the template sequences. We 
have also previously identified 17 positions that occur in 
the helix-helix interface. They are small and weakly polar 
residues and have been shown to be highly conserved if the 
amino acids Ala, Thr, Ser, Cys and Gly are considered as 
a group (15,16). These positions along with the residues 
that form the ar/R selectivity filter are highlighted in the 
structure-based sequence ahgnment. An example of 
structure-based sequence ahgnment as obtained from the 
MIPModDB database is shown in Figure 3. 

Construction of phylogenetic tree for selected MIPs 

The database provides an interface whereby the user wiU 
be able to analyze the evolutionary relationship among the 
selected group of MIPs. For constructing a phylogenetic 
tree, the user needs to select at least three sequences. The 
user wiU have the option to choose one of the three dif- 
ferent methods, namely, neighbor-joining method, max- 
imum likelihood method and maximum parsimony 
method to construct a phylogenetic tree. In addition to 
the complication in generating a multiple sequence ahgn- 
ment of diverse MIP sequences, many MIPs have long 
N- and C-termini as well as long loops connecting the 
TM segments. Hence to avoid errors, while constructing 
the evolutionary tree, the input used for this purpose is the 
structure-based sequence alignment of TM helical regions 
and the loops B and E. The program PHYLIP which is 
part of the EMBOSS suite of programs (54) is being 
utilized by the server to create the phylogenetic trees 
(Figure 2D). 

IMPLEMENTATION 

The information content on MIPs is maintained as a re- 
lational database using MySQL (http://www.mysql.com). 
This allows easy access and storage. The database is 
hosted on a web server running apache (http://www 
.apache.org/) on Fedora Core Linux platform and can 
be queried through the web interface which is imple- 
mented in PHP (v5.2) scripting language (http://www 
.php.net/). 

ACCESS TO MIP DATA 

MIPModDB aUows the users to browse, retrieve and 
query the database. The Statistics page fists all the MIPs 
in three different categories: (i) based on the NPA 
sequence motif, (ii) selectivity filter residues and 
(iii) organism-wise grouping. Users can browse MIPs ac- 
cording to the conservation or substitutions that are found 
in the NPA boxes. One can also look for MIPs with par- 
ticular residues that form the selectivity filter. MIPs are 
also organized as per the organism in which they occur. 
They are arranged in the descending order in which the 
model tree Populus tricocarpa is present on the top with 
the maximum number of 55 MIPs. 

The database interface allows the user to retrieve and 
identify MIPs using various features of MIPs as query. 



It can be searched by unique accession and complete or 
partial amino acid sequence. The search facihty also 
allows the user to select MIP(s) with particular residues 
in the selectivity filter. Alternatively, MIPs with specific 
substitutions in the NPA motif also can be queried. A 
query based on an organism name will retrieve aU MIP 
sequences from a particular species. More than one MIP 
features can also be used to narrow down the search. 
More detailed information can be retrieved by following 
the associated finks. In addition to searching specific 
MIPs, the database enables users to download aU the 
MIP sequences from a given organism or all the sequences 
that have the same ar/R selectivity filter residues in 
FASTA format. MIP sequences with specific substitutions 
in the NPA motif can also be downloaded. Sequence 
ahgnments used to generate the three-dimensional models 
can be downloaded in PIR or PAP format. For each MIP, 
the coordinates of the model are available in the PDB 
format. Similarly, the phylogenetic tree for a selected set 
of MIPs can be downloaded. 



COMPARISON WITH OTHER TRANSPORTER 
DATABASES 

There are databases which are developed specifically for 
membrane proteins that are involved in transporting the 
solutes across the membranes. The database TransportDB 
(66) provides information about the complete list of trans- 
porters for a given organism. For example, in the case of 
humans, the type of transporters listed include those that 
are ATP-dependent, ion channels, secondary transporters 
and unclassified. It also fists all outer membrane porins 
and channels from different organisms. The other database 
TCDB is a Transporter Classification Database (67) and is 
a classification system for membrane transporters. The 
superfamihes of transporters in this database include 
channel-forming toxins and peptides, transporters, sympor- 
ters, antiporters, porins, carriers and ion channels. Both 
the above databases do not include all the MIP super- 
family members whose members are known to predom- 
inantly transport neutral solutes. For example, while 
TransportDB hsts human MIPs, MIP members in other 
organisms are not found. TCDB does not seem to include 
MIP superfamily although some MIP sequences are 
found. Moreover, while the above two databases 
largely contain sequence information and known PDB 
structures, MIPModDB provides structural models and 
associated information for more than 1000 MIP 
sequences. 

FUTURE DIRECTIONS 

Few MIPs have been functionally very well characterized 
and experimental studies are being carried out to deter- 
mine the solute transport properties of many more MIPs. 
In the next version of MIPModDB database, whenever 
it is available, functional properties of MIPs, post- 
translational modification and cellular locahzation will 
be annotated and the related hterature will be linked 
through PUBMED. As new MIPs are being recognized. 
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STRUCTURE BASED ALIGNMENT 
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Figure 3. Structure-based sequence alignment for a selected set of MIPs. Tfiis alignment is always produced with the six high-resolution MIP 
structures and their PDB IDs are also shown. This alignment is produced for the six transmembrane segments and the two functionally important 
loops B and E. The residues forming the ar/R selectivity filter are shown in the dark brown background. Seventeen positions previously identified to 
occur in the helix-helix interface (16) are highly group conserved when small and weakly polar residues are considered together as a group. They are 
displayed in cyan background. 



the users will have the option to upload the sequences in 
the future version. Ultimately, our software will follow a 
pipeline procedure that will take an MIP sequence to 
series of steps which will include extracting the sequence 
features, building a homology model, identifying the se- 
lectivity filter residues, generating the HOLE radius profile 
and a possible prediction of the solutes that are hkely to be 
transported. 
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